Let’s just display summaries of the data set in various ways to get some sense of the data
## [1] 1599 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Let’s first see the distribution of each one of the variables
It is clear some variables are very much normally distributed, some others not so much. More discussion about this in the analysis.
For now it is clear that Residual Sugar, Chlorides and total Sulfur Dioxide distributions have very long tails. Let’s transform the data to log10 to see how it looks.
The data set is tidy. The initial column has unique IDs for each of the wines included. The rest of the variables are measurements of variables of the chemical composition of the wine. The last variable is the quality assessed for the wine.
Quality is an integer which can take values from 0 to 10. None of the wines in the data set has values lower than 3 or higher than 8. The quality seems normally distributed.
The main feature would be the quality. The main interest is in knowing which other variables are directly correlated with wine quality. It also may be of interest to see what variables are correlated to each other, independetly of how they affect the quality.
In particular I would be interested in Residual Sugar, Alcohol and Citric Acid, since from my perspective these are among the more palpable features that could affect a wine taste and perceived quality.
All other features could play a factor in the perceived quality of the wine.
Yes, I created a Variable named “Category” which will indicate the quality in three possible values High (Score 7 and 8), Medium (5, 6) , and Low (3,4). This to group the wines by quality level and see their statistics for the main features at that level of granularity.
Some of the variables have a clear normal distribution such as Density and pH as well as the acidity (volatile and fixed), although those latter two with a slightly long tail.
Chlorides and Residual sugar have very long tails. Which means that the majority of the wines have a low amount of salt and sugar respectively, and very few have high amounts. Similar cases happen with sulfur dioxide. In which case makes sense since you would want a good amount of it to prevent oxidation but not too much that will be perceived in the taste.
Interesting cases to me are the alcohol and the citric acid. In the case of alcohol we can see that its levels are more widespread. Although still it is clear that less wines have higher amounts of alcohol. The clear peak is between 9 and 10 percent.
In the case of Citric Acid we can almost observe 3 peaks. One close to 0 grams per liter, another one around 0.25 and one more at 0.5. I wonder if those fixed amounts are typical measurements for wines defines by some other criteria.
The quality seems to have the typical normal distribution with a mean of 5.6 and a median of 6. I found interesting that none of the wines were evaluated higher than 8 or less than 3.
I wonder how the distribution of some of these variables change depending of the quality
There seems to be a few differences in the distributions depending on the quality for Alcohol and Citric Acid.
Let’s subset the wines in medim quality (5-6) and high quality (7-8) and see their stats summaries for each of these features. For this I will add a column with “Category” which will indicate the quality in three possible values High (Score 7 and 8), Medium (5, 6) , and Low (3,4).
## alcohol
## Min. : 9.20
## 1st Qu.:10.80
## Median :11.60
## Mean :11.52
## 3rd Qu.:12.20
## Max. :14.00
## alcohol
## Min. : 8.40
## 1st Qu.: 9.50
## Median :10.00
## Mean :10.25
## 3rd Qu.:10.90
## Max. :14.90
## alcohol
## Min. : 8.40
## 1st Qu.: 9.60
## Median :10.00
## Mean :10.22
## 3rd Qu.:11.00
## Max. :13.10
## citric.acid
## Min. :0.0000
## 1st Qu.:0.3000
## Median :0.4000
## Mean :0.3765
## 3rd Qu.:0.4900
## Max. :0.7600
## citric.acid
## Min. :0.0000
## 1st Qu.:0.0900
## Median :0.2400
## Mean :0.2583
## 3rd Qu.:0.4000
## Max. :0.7900
## citric.acid
## Min. :0.0000
## 1st Qu.:0.0200
## Median :0.0800
## Mean :0.1737
## 3rd Qu.:0.2700
## Max. :1.0000
Now that I have made groups I would like to see the distribution of the main variables that interest me, alcohol and citric acid.
It is hard to tell because there are just a few instances of wines with low (3,4) or high (7,8), but it seems that there is a difference in distribution of alcohol levels for wines graded higher than 5. It looks like the wines in the higher quality tend to have higher levels of alcohol. Wines with quality 5 and 6 have a mean of 10.25 grams, while wines with quality 7 and 8 have a mean of 11.52. This is also supported by the box plots where we can see the median values for each one of the three quality categories.
Sugar on the other hand seems to have the same distribution across wines of all qualities.Peaking between two and three grams per liter.
Citric acid distribution seems about the same for wines with quality 5 and 6. But for those in higher quality the initial peak close to 0 seems to be very reduced leaving the majority of wines with amounts between 0.25 and 0.50 grams per liter. Wines with 5-6 quality have a mean of 0.25, while 7-8 quality have a mean of 0.37. Again this is depicted by the box plot and the median values.
Now let’s inspect the correlation of thesse.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.256130895 0.67170343
## volatile.acidity -0.25613089 1.000000000 -0.55249568
## citric.acid 0.67170343 -0.552495685 1.00000000
## residual.sugar 0.11477672 0.001917882 0.14357716
## chlorides 0.09370519 0.061297772 0.20382291
## free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813
## total.sulfur.dioxide -0.11318144 0.076470005 0.03553302
## density 0.66804729 0.022026232 0.36494718
## pH -0.68297819 0.234937294 -0.54190414
## sulphates 0.18300566 -0.260986685 0.31277004
## alcohol -0.06166827 -0.202288027 0.10990325
## quality 0.12405165 -0.390557780 0.22637251
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.114776724 0.093705186 -0.153794193
## volatile.acidity 0.001917882 0.061297772 -0.010503827
## citric.acid 0.143577162 0.203822914 -0.060978129
## residual.sugar 1.000000000 0.055609535 0.187048995
## chlorides 0.055609535 1.000000000 0.005562147
## free.sulfur.dioxide 0.187048995 0.005562147 1.000000000
## total.sulfur.dioxide 0.203027882 0.047400468 0.667666450
## density 0.355283371 0.200632327 -0.021945831
## pH -0.085652422 -0.265026131 0.070377499
## sulphates 0.005527121 0.371260481 0.051657572
## alcohol 0.042075437 -0.221140545 -0.069408354
## quality 0.013731637 -0.128906560 -0.050656057
## total.sulfur.dioxide density pH
## fixed.acidity -0.11318144 0.66804729 -0.68297819
## volatile.acidity 0.07647000 0.02202623 0.23493729
## citric.acid 0.03553302 0.36494718 -0.54190414
## residual.sugar 0.20302788 0.35528337 -0.08565242
## chlorides 0.04740047 0.20063233 -0.26502613
## free.sulfur.dioxide 0.66766645 -0.02194583 0.07037750
## total.sulfur.dioxide 1.00000000 0.07126948 -0.06649456
## density 0.07126948 1.00000000 -0.34169933
## pH -0.06649456 -0.34169933 1.00000000
## sulphates 0.04294684 0.14850641 -0.19664760
## alcohol -0.20565394 -0.49617977 0.20563251
## quality -0.18510029 -0.17491923 -0.05773139
## sulphates alcohol quality
## fixed.acidity 0.183005664 -0.06166827 0.12405165
## volatile.acidity -0.260986685 -0.20228803 -0.39055778
## citric.acid 0.312770044 0.10990325 0.22637251
## residual.sugar 0.005527121 0.04207544 0.01373164
## chlorides 0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide 0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide 0.042946836 -0.20565394 -0.18510029
## density 0.148506412 -0.49617977 -0.17491923
## pH -0.196647602 0.20563251 -0.05773139
## sulphates 1.000000000 0.09359475 0.25139708
## alcohol 0.093594750 1.00000000 0.47616632
## quality 0.251397079 0.47616632 1.00000000
At first glance there does not seem any super strong correlations betwen variables. Certainly not between quality and any other. But let’s look at scatter plots of the strongest relationships of quality against other variables
There are clear and obvious vertical lines for quality since it is a descrete variable.
Let’s try looking at the means for each quality score, and plot for the other strong correlation variables against quality.
Besides the correlation between Quality and those other variables, there are in fact stronger correlations between the others that were of less interest. Some positive some negative.
There aren’t any strong correlations (>= 0.7) between quality and and the other features. But Quality seems to be moderately correlated with some of them. In particular with Alcohol. I sort of expected that correlation to be there, same with citric acid. As the quality increases so do the levels of Alcohol and Citric acid. Although this does not mean that those substances increase the quality of the wine, I would expect some relationship since alcohol and the freshnes given by citric acid are essencial parts of the flavour of a wine.
Alcochol average levels start at 10 for lower quality wines, and stop at around 12 for the highest rated wines. I think the slight increase and then decrease observed for wines with a quality score of 4 accounts for the rather moderate correlation.
Citric acid goes up as does the quality from around 0.2 to 0.37. There are no downwwards trends but the slope is a bit less steep than for the alcohol from quality 5 to 8.
Sulfates have a low positive correlation while volatile acidity has a negative correlation with a clear decline of the sulfate level with wine quality between 3 and 7.
Maybe an interesting, albeit expected strong negative correlation is between pH and Fixed Acidity. As the pH increases and the wine becomes more alkaline the acidity reduces.
Both Free and Total sulfure dioxide are positively correlated which is also expected as the free form is a subset of the total sulfure dioxide.
Citric Acid has an interesting relationship with the acidity levels. It has a positive correlation with fixed acidity and a negative correlation with volatile acidity. This is explained by this article in wikipedia: https://en.wikipedia.org/wiki/Acids_in_wine#Citric_acid, where it is clarified that citric acid is in fact a fixed acid. I speculate that as wine makers add more fixed acids they reduce the volatile acids. This is somewhat supported by looking at the negative correlation between these types of acids.
The strongest relationhip was bwteen pH and Fixed Acidity. But as explained before that is just because of the nature of those variables. The strongest correlation including Quality was against Alcohol levels with 0.47616632. I was suprised at the low correlation between Residual Sugar and Quality which only was 0.04207544.
I am mostly interested in seeing the same correlated variables by wine quality category to see if there are any big differences.
For most of the correlated variables we can observe that the same mean value trends are found across all wine quality categories.
Perhaps one exceptions are high and Low quality wines when comparing Total vs Free sulfur dioxide. In which case their means are somewhat far from the overall mean. I think just some wines in those categories have atipical amounts of fulfur dioxide. But looking at the scatter points behind we can see that it is mostly lone instances of wines that are far from the average mean and pull the mean for that category.
Not particularly, they seem to follow the expected trend that was seen when looking just a two variables.
None at this time.
Distribution of the variable most correlated to quality, Alcohol. We define wine quality categories of Low (Scores 3 and 4), Medium (Scores 5 and 6) and High (Scored 7 and 8).
Most of the wines rated fall in the Medium quality for this variables. Looking at the overall distribution low quality wines have levels on the lower side of the scale, wihile High quality ones have higher counts in the higher end of the scale.
We focus on the correlation of Alcohol, the most correlated variable against wine quality scores. We define wine quality categories of Low (Scores 3 and 4), Medium (Scores 5 and 6) and High (Scored 7 and 8).
We can observe a tendency for higher quality wines to have higher levels of alcohol. Specially an observable difference is found between medium and high quality wines. The mean level of alcohol increases from just under 10% to slightly above 12% between score 5 and 8. While the median increases from 10% to 11.6% between Medium and High quality wines.
We analyze the change in the average of Citric Acid levels as alcohol levels increase. Generally Medium quality wines follow the same trend as the overall average. High quality wines have considerable less citric acid when the alcohol increases over 12%. Low Quality wines seem to have pretty low levels of citric acid overall with a downward trend to 11% alcohol and and slight increase afterwards when alcohol increases.
The objective was to find the variables that may have an inpact or would be tightly correlated to wine quality. I was expecting higher correlation to quality than the results. Looking at the correlation graph it was clear that there were no such high correlations.
Fortunately the Alcohol and Citric Acid while not having a high correlation score, do show a trend that suggests higher levels of those in higher quality wines.
Looking at correlation between other variables I was only able to understand how those substances or variables interact with each other and affect each other. But it was difficult to establish how any one of them or combination of them may have an impact on quality.
It would be interesting to play more with the data if it had more categorical variables to move around. Things like region where these wines are from, weather conditions on those regions, grape grow variables such as humidity and others could enrich the data an analysis to understand more what goes into crafting a good bottle of wine.